24 research outputs found

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    AutoML in Heavily Constrained Applications

    Full text link
    Optimizing a machine learning pipeline for a task at hand requires careful configuration of various hyperparameters, typically supported by an AutoML system that optimizes the hyperparameters for the given training dataset. Yet, depending on the AutoML system's own second-order meta-configuration, the performance of the AutoML process can vary significantly. Current AutoML systems cannot automatically adapt their own configuration to a specific use case. Further, they cannot compile user-defined application constraints on the effectiveness and efficiency of the pipeline and its generation. In this paper, we propose Caml, which uses meta-learning to automatically adapt its own AutoML parameters, such as the search strategy, the validation strategy, and the search space, for a task at hand. The dynamic AutoML strategy of Caml takes user-defined constraints into account and obtains constraint-satisfying pipelines with high predictive performance

    The Need for Incorporation of the Principles of Fiscal Sociology in Social Policy in Ukraine

    Get PDF
    У статті запропоновано використати новий принцип фінансування соціальних видатків у країні з недо-статнім рівнем демократії в умовах економічної кризи, який пропонують назвати анти-оптимум Парето.В статье предлагается использовать новый принцип финансирования социальных расходов в стране с недостаточным уровнем демократии в условиях экономического кризиса, который предлагается назвать анти-оптимум Парето.In the article it is suggested to use new principle of financing social charges in a country with the insufficient level of democracy in the conditions of economic crisis, which it is suggested to name аnti-optimum of Pareto

    Unsupervised String Transformation Learning for Entity Consolidation

    Full text link
    Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management (MDM) systems, can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way (i.e., they share a transformation). Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool

    Duplicate Table Detection with Xash

    Get PDF
    Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates

    Advancing the discovery of unique column combinations

    No full text
    Unique column combinations of a relational database table are sets of columns that contain only unique values. Discov-ering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are ei-ther brute force or have a high memory load and can thus be applied only to small datasets or samples. In this pa-per, the well-known Gordian algorithm [9] and “Apriori-based ” algorithms [4] are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian com-bines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situa-tions

    SPRINT: Ranking Search Results by Paths

    No full text
    Graph-structured data abounds and has become the subject of much attention in the past years, for instance when searching and analyzing social network structures. Measures such as the shortest path or the number of paths between two nodes are used as proxies for similarity or relevance[1]. These approaches benefit from the fact that the measures are determined from some context node, e.g., “me ” in a social network. With SPRINT, we apply these notions to a new domain, namely ranking web search results using the linkpath-structure among pages. SPRINT demonstrates the feasibility and effectiveness of Searching by Path Ranks on the INTernet with two use cases: First, we re-rank intranet search results based on the position of the user’s homepage on the graph. Second, as a live proof-of-concept we dynamically re-rank Wikipedia search results based on the currently viewed page: When viewing the Java software page, a search for “Sun ” ranks Sun Microsystems higher than the star at the center of our solar system. We evaluate the first use case with a user study. The second use case is the focus of the demonstration and allows users to actively test our system with any combination of context page and search term
    corecore